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AN  ELEMENTARY  STATISTICAL  APPROACH  TO 
MEASURING  UNCERTAINTY  IN  A  COST  ESTIMATE  RANGE 


P.  R.  Garvey* 


The  intent  of  this  article  is  to  suggest  an  approach  to  developing  a  proba¬ 
bilistic  cost  range  for  a  system  in  the  conceptual  design  phase.  This  methodol¬ 
ogy  has  been  applied  to  a  recent  software  cost  study  for  a  large  scale  military 
acquisition  program,  hence  the  emphasis  in  this  paper  will  be  on  the  problem 
of  determining  the  most  probable  software  development  cost  interval.  ^ _ 


INTRODUCTION 

In  this  article  we  will  consider  a  system  as 
a  regularly  interacting  or  interdependent 
group  of  items  comprised  of  hardware  and/or 
software  elements  forming  a  unified  whole. 

In  many  large  scale  command  and  control 
projects,  Prime  Mission  Product  (PMP)  cost  es¬ 
timates  developed  for  Full  Scale  Engineering 
Development  (FSED)  are  usually  reported  as  a 
range,  and  hence  are  not  necessarily  intended 
for  budgetary  purposes,  but  rather  to  provide 
information  to  the  respective  Program  Office 
to  aid  in  system  engineering  trade-off  studies 
and  acquisition  planning  activities. 


When  a  project  is  in  concept  definition,  or 
initial  development,  the  precise  determination 
of  system  cost  is  usually  not  possible.  For  soft¬ 
ware  intensive  systems,  estimates  of  Computer 
Program  Configuration  Item  (CPCI)  size  prior 
to  FSED  are  subjective,  and  are  often  based  on 
comparable  software  tasks,  or  from  advanced 
prototype  designs.  The  variability  in  a  software 
cost  estimate  is  directly  related  to  the  variabil¬ 
ity  in  CPCI  size,  which  may  vacillate  around 
data  points  from  low  =  a',  most  likely  =  m',  to 
high  =  b'  estimates  of  lines  of  code  (LOC): 

LOCrangc:  a'  <  m'  <  b'. 

A  hierarchical  overview  of  the  procedures 
for  developing  a  software  cost  range  is  shown 
in  Figure  1. 


*Thc  views  and  conclusions  contained  in  this  document  are 
those  ot  the  author  and  should  not  be  interpreted  as  necessarily 
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Figure  1.  System  Overview 


It  is  not  my  intention  to  address  the  inher¬ 
ent  estimation  error  associated  with  many  of 
the  parametric,  or  non-parametric  software 
cost  models.  Rather,  the  following  discussion 
attempts  to  provide  a  non-rigorous  method  to 
measure  the  degree  of  uncertainty  in  a  soft¬ 
ware  cost  estimate  range  generated  when  only 
subjective  technical  assessments  on  LOC  are 
available.  Although  this  probability  technique 
was  developed  for  treating  software  costs,  it 
can  easily  be  extended  to  any  other  costs,  such 
as  hardware,  which  are  stated  as  a  range. 


THE  EXPECTED  COST 

In  circumstances  where  it  becomes  neces¬ 
sary  to  report  a  system  cost  as  the  most  repre¬ 
sentative  point  in  the  cost  estimate  range,  a 
useful  measure  of  central  tendency  to  deter¬ 
mine  is  the  expected  cost.  The  expected  cost  is 
the  point  of  the  center  of  gravity  in  the  system 
cost  range.  Mathematically,  the  expected  cost 
is  defined  by 

H  =E(X)  =  |  xf(x)dx  I:  [a  <  x  <  b] 

where  X  denotes  the  cost  random  variable,  and 
f  is  the  continuous  probability  density  function 
(pdf)  of  X.  The  integral  limits  a  and  b  repre¬ 
sent  respectively  the  low  and  high  extrema  in 
the  cost  interval  I.  By  definition,  f(x)  must  sat¬ 
isfy  the  following  properties 

*  f(x)  >  0  in  I,  I:  [a  <  x  <  b] 

f(x)dx  =  1  for  the  continuous  case 


Since  the  true  underlying  distribution  of  X  is 
unknown,  a  probability  distribution  of  cost 
may  be  expressed  by  choosing  an  appropriate 
probability  function  that  most  accurately  re-  ^ 
fleets  the  unique  system  cost  behavior.  Define  f 
as  the  pdf  that  is  the  analyst’s  ‘‘best”  subjec¬ 
tive  approximation  to  the  true  underlying  den¬ 
sity  function  f.  Thus 

A 

f  =  f 

We  will  further  require  the  approximating 
probability  function  to  satisfy  the  boundary 
conditions  f(a)  =  0  and  ?(b)  =  0.  The  maximum 
value  of  T  is  defined  by 

A  A 

f(m)  =  max  f(x)  I:  [a  <  x  <  b] 
xel 


/  J 


Several  classes  of  probability  functions  satisfy 
the  above  criteria.  This  article  will  consider 

•  A  polynomial  density  function 

•  A  triangular  density  function 

Expressions  for  their  means  and  variances  will 
be  derived. 


A  Polynomial  Density  Function 

Consider  the  situation  where  an  analyst  ob¬ 
tains  subjective  values  of  a  and  b,  but  the  most 
likely  value  m  is  not  given  as  a  point,  but  as  a 
percentage  from  the  lower  bound  of  I.  Define  a 
4-degree  unimodal  polynomial  density  function 
by  g(x)  where 


g(x)  =  £  Cn  x",  I:  [a  <  x  <  b] 

n  =  0 

which  also  satisfies  the  conditions  that  g(a)  = 

0,  and  g(b)  =  0.  Further,  assume  there  exists  a 
unique  maximum  point  m  contained  in  I  such 
that 

g(m)  =  max  g(x) 
xd 

The  following  discussion  considers  two  distinct 
functional  variations  of  g(x).  These  forms,  de¬ 
noted  by  g,(x)  (j  =  I  or  2),  are  each  uniquely 
determined  by  the  location  of  their  mode  m, 

(j  =  1  or  2),  where 


gj(x):  I - 1 - 1  :  I:  [a,  b] 

ax  b 

to  the  unit  interval  Z 


Pi(z):  I _ I _ I  :  Z:  [0,  1] 

0  z  1 

by  the  linear  transformation 

x  -  a 

z  =  -r - 

b  -  a 

and  form  a  4-degree  unimodal  polynomial  den¬ 
sity  function,  p,(z)  (j  =  1  or  2),  on  the  Z(a))  inter¬ 
val  as  shown  in  Figure  2. 


P 


Z 


Figure  2.  A  Polynomial  Density  Function 


g^m,)  =  max  g(0.3(b  -  a)  +  a) 
xd 

g>(m;)  =  max  g(0.7(b  -  a)  +  a) 
xd 

Expressions  for  the  mean  and  variance  of  g, 
will  be  derived. 

To  reduce  the  computational  complexity 
when  computing  the  mean  and  variance  of  g,, 
transform  the  initial  interval  I 


The  value  of  g,  (m,)  (j  =  1  or  2)  when  trans¬ 
formed  into  the  unit  Z  interval  occurs  at  the 
points 

(0  3(b  -  a)  +  a)  -  a 


(0.7(b  -  a)  +  a)  -  a 


as  shown  in  Figure  2. 


The  equations  representing  pt  (j  =  1  or  2)  are 
Pi  =  y  (1  -  z)2  [l  -  (1  -  z)2]  ,  0  <  z  <  1 

Pi  =  y  z2(l  -  z2),  0  <  z  <  1 

Note  that  this  density  function  is  symmetric 
about  1/2,  and 

Pi(z)  =  p2(l  -  z)  ,  0  <  z  <  1 

The  expected  value  of  p,  (j  =  1  or  2)  is  defined 
bv 

E(Zj)  =  fz  zpj(z)dz  Z:  [0,  1] 
and  the  variance  <r,  is  computed  from 
^  =  E(Z2)  -  E(Zj)2 

where 

E(Z2)  =  z2P;(z)dz  Z:  [0,  1] 

On  the  unit  Z[o  t]  interval  we  then  have 

E(Z,)  =  y  ,  E(Z2)  =  y 

<jy  =  for  each  j  =  1  or  2 

Mapping  these  values  back  into  our  original  in¬ 
terval  I  we  have 

E( X , )  =  y  (5a  +  3b)  E(X2)  =  y  (3a  +  5b) 

<7\  =  (b  -  a)2  for  each  j  =  1  or  2 

Note  that  these  expressions  for  the  expectation 
and  variance  are  explicitly  independent  of  m. 

As  an  application,  suppose  our  cost  inter¬ 
val  I  is  determined  to  be  I:  [$30,  $50]  where 
a  =  $30  and  b  =  $50.  Further,  if  the  "best”  ex¬ 
pert  assessment  places  the  most  likely  value  at 


appoximately  30%  from  the  lower  bound  of  1, 
then  we  could  use  p]  as  our  approximating  pdf, 
from  which 

E(X)=y  (5a  +  3b)  =  $37.5 
ox  =  $3.9 

The  Triangular  Density  Function 

When  little  information  is  available  other 
than  subjective  estimates  on  the  extreme  val¬ 
ues  of  the  cost  interval  I,  it  is  often  convenient 
to  apply  a  triangular  density  function  r(x) 
through  the  cost  range.  Classically,  t(x)  has  the 
representation 

r(x)  =  of1  (1  -  of1  •  |x|),  | x |  <  a  (Ref.  2) 

and  is  symmetric  about  the  origin  in  the  inter¬ 
val  -a  <  x  <  a.  For  our  purposes,  we  will  de¬ 
fine  a  similar  functional  form,  fT,  but  one  that 
is  bounded  by  x  >  0,  and  satisfies  the  bound¬ 
ary  conditions,  f7(a)  =  0,  and  fr(b)  =  0  with 

fr(m)  =  max  f7(x)  I:  [a  <  x  <  b] 
xel 

Such  a  function  is  shown  below  in  Figure  3. 
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Figure  3.  A  Triangular  Density  Function 


The  probability  density  function  for  fT  is 
defined  by 

Ic 

-  (x  -  a)  if  0  <  a  <  x  <  m 

m  -  a  '  ' 

c 

- r  (x  -  b)  if  m  <  x  <  b 

m  -  b  v 

where  c,  the  peak  value  (modal  point)  of  the 
triangular  density  is  given  by:  2/(b  -  a).  Based 
on  this  density,  we  can  compute  the  first  two 
moments  of  fr.  They  are 

M  =  E(X)  =  j  xfr(x)  dx  = 

I-j-  •  (a  +  m  4-  b)  if  m  is  known 

-j-  •  (a  +  b)  if  m  is  unknown 


e(x2)  =  [  x2  fr(X)  dx  =  y  *  (b ha 

•  (m3(3m  -  4a)  +  a4) 

+  b~^m  '  (m3(3m  ~  4b)  +  b4)  J 

The  cost  variance,  denoted  by  is  defined  by 
ox  =  E(X2)  -  E(X)2 
which  reduces  to 

oi  =  -jg-  {(m  -  a)  (m  -  b)  +  (b  -  a)2} 

These  simple  measures  of  central  tendency 
are  useful  for  establishing  the  basis  for  a  cost 
estimate  range  when  little  specific  information 
regarding  the  nature  of  a  system  is  available. 
Measures  of  expectation,  determined  by  the 
pdf  chosen,  inform  the  analyst  where  the  un¬ 
certainty  is  greatest,  skewed  to  the  left  or  to 
the  right  of  the  modal  point.  The  variance  c r, 


can  be  used  to  establish  a  confidence  criteria 
on  the  bound  of  a  cost  estimate  range.  The  next 
section  applies  these  statistical  measures  to 
the  problem  of  establishing  a  conservative 
probabilistic  cost  range  based  on  information 
obtained  from  n  and  a2. 

THE  CHEBYSHEV  BOUND 

The  integrity  of  the  software  cost  estimate 
and  any  subsequent  statistical  inference  is  de¬ 
pendent  on  the  assumption  that  estimated 
CPCI  size  adequately  reflects  reality.  Under 
this  assumption  conservative  probability  state¬ 
ments  can  be  made  about  the  likelihood  that 
the  estimated  cost  range  will  capture  the  true 
cost,  that  is,  to  be  within  some  Chebyshev 
bound.  In  theory,  the  Chebyshev  bound  states 
that  the  true  value  of  the  cost  random  variable 
X  differs  from  the  expected  cost  n  by  no  more 
than  ka  standard  deviations,  with  probability 
at  least  equal  to  1  -  1/k2,  k  >  1: 

Pr  (|X  -  n\  <  ka)  >  1  -  1/k2. 

No  a  priori  assumption  about  the  under¬ 
lying  nature  of  the  cost  random  variable  X  is 
made  other  than  that  n  and  a  exist.  A  sketch  of 
this  cost  range  is  shown  below. 


M  -  ko  U  m  +  ko 


Pr  (lx  -  nl  <  ko)  >  1  -  1/k2 
Figure  4.  A  Chebyshev  Cost  Range 
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CONCLUSION 

The  problem  of  determining  a  three  point 
approximation  to  a  continuous  random  vari¬ 
able  with  an  unknown  distribution  is  a  popular 
topic  among  researchers  in  the  Management 
Sciences  area  (Ref.  1).  Some  mean  and  variance 
approximation  algorithms  are  computationally 
complex  and  require  time  consuming  computer 
simulation. 

Current  research  has  yet  to  adequately  de¬ 
velop  a  procedure  which  models  this  problem. 

The  informal  techniques  outlined  in  this  paper 
support  an  analytical  rationale  for  assessing 
uncertainty  in  a  cost  estimate.  These  non¬ 
rigorous  procedures  form  the  basis  of  a  deci¬ 
sion  tool  that  provides  the  analyst  with  a 
method  to  make  conservative  probability  state¬ 
ments  about  cost  intervals  when  only  subjec¬ 
tive  technical  inputs  are  available. 


Applied  to  a  hypothetical  system,  suppose  we 
determine  from  a  selected  density  function 
such  as  the  polynomial  or  triangular  pdf,  that 

E(X)  =  $50,000  and 
<tx  =  $1,500 

then  with  an  interval  length  of  two  standard 
deviations  from  the  mean  there  is  at  least  a 
75%  chance  that  the  true  subsystem  cost  will 
fall  in  the  cost  range  $47,000  -  $53,000.  That  is, 
using  the  Chebyshev  inequality  for 
k  =  2,  the 

Pr  ($47,000  <  X  <  $53,000)  >  0.75 
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